A 



ICC 2017 

WASHINGTON 
JULY 02-07, 2017 


Towards Seamless Validation of Land Cover Data 


Ekaterina Chuprikova \ Lukas Liebel 2 , and Liqiu Meng 3 

1. Chair of Cartography, Technical University of Munich, Germany; e.chuprikova@tum.de 

2. Chair of Remote Sensing Technology, Technical University of Munich; lukas.liebel@tum.de 

3. Chair of Cartography, Technical University of Munich, Germany; liqiu.meng@bv.tum.de 


Abstract: This article demonstrates the ability of the Bayesian Network analysis for the recognition of uncertainty 
patterns associated with the fusion of various land cover data sets including GlobeLand30, CORINE (CLC2006, Ger- 
many) and land cover data derived from Volunteered Geographic Information (VGI) such as Open Street Map (OSM). 
The results of recognition are expressed as probability and uncertainty maps which can be regarded as a by-product 
of the GlobeLand30 data. The uncertainty information may guide the quality improvement of GlobeLand30 by in- 
volving the ground truth data, information with superior quality, the know-how of experts and the crowd intelligence. 
Such an endeavor aims to pave a way towards a seamless validation of global land cover data on the one hand and a 
targeted knowledge discovery in areas with higher uncertainty values on the other hand. 
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1. Introduction 

Global land cover (GLC) is an important data source for environmental applications that helps us to understand the 
dynamic of our changing planet (Ban et al. 2015). Despite the development of remote sensing technology, the task of 
mapping and monitoring of our environment is still very challenging. Numerous efforts have led to GLC datasets of 
different resolutions such as GLC2000 (1km), GlobCover (300m), GLCF water mask (250m) and the latest publicly 
available GlobeLand30 (30m) produced by National Geomatics Center of China. Recent studies within the land cover 
community are increasingly committed to the validation of GlobeLand30 data (Brovelli et al. 2015, Sun et al. 2015, 
Arsanjani et al. 2016). The validation results revealed an overall accuracy between 46% and 83% for different study 
areas. Obviously some areas have higher degree of uncertainty than other areas, which necessitates further quality 
improvements of the GlobeLand30 data. 

Due to multiple error sources and inaccuracies involved in data acquisition, processing and classification, the as- 
signed land cover classes may differ from the ground truth and therefore they are uncertain. It is crucial to account for 
uncertainty that is useful for the evaluation of the data quality, decision-making practices and the creation of a general 
awareness. Apart from numerous definitions, we treat the uncertainty in this article as an indicator of distrust between 
the classified data in referenced datasets. Consequently, the higher the uncertainty, the more likely the actual land 
cover class of a particular location needs to be validated. Our work aims to model the GlobalLand30 data uncertainty 
based on probabilistic methods and to create a visualization support for the data refinement. We use the Bayesian 
Network framework for the uncertainty analysis of GlobeLand30 data. The Bayesian Networks are probabilistic 
graphical models that utilize probabilities obtained from the auxiliary data or expert notions to determine quantitative 
values of the uncertainty. In this work we focus on areas or GlobeLand30 classes that have higher degree of uncer- 
tainty. Volunteered geographic information (VGI), in particular, the Open Street Map (OSM) is adopted as an inex- 
pensive auxiliary data source. In order to adapt the extracted dataset for our use case, we conducted a comprehensive 
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analysis of the suitable OSM tags and assigned the closest GlobeLand30 class code to the OSM tags in our study area. 
The resulting maps will then visually guide the experts or non-expert users to explore the areas where a high degree 
of mismatch between the GlobeLand30 results and OSM information occurs. 


2. Study Area and Data 

This research investigates uncertain areas of the Global Land Cover classification, adopting the datasets of 
GlobeLand30, CORINE and Volunteered Geographic data based on OpenStreetMap for the area of Upper Bavaria 
(German: Oberbayern), Germany. Even though, none of the reference data sets is ground truth, we perform the data 
fusion in order to reveal variations in the classification. The test area of Upper Bavaria was selected due to availability 
of high quality datasets. This could reduce the complexity of the analysis and provide a basis for the study case. 
Moreover, the test area reveals a sufficient diversity in distribution of land classes, thus provides a solid background 
for exploring probability of each land class. Although the three datasets are different from each other, they are unified 
to the resolution of 30m and land cover classification based on GlobeLand30 schema. 


2.1 GlobeLand30 

In 2010 China launched a project with the aim to identify global land cover classes in resolution of 30 meters. The 
GlobeLand30 data set is freely available and comprise 10 major classes of land cover, including cultivated areas, 
forests, grassland, shrub land, wetland, water bodies, tundra, artificial surfaces, bare land and permanent snow and 
ice. The classification is available for two base-line years, 2000 and 2010. The GlobeLand30 was produced based on 
more than 20,000 Landsat and Chinese HJ-1 satellite images (see www.globallandcover.com). Previous studies (Chen 
et al. 2015) indicate that the validation based on comparison with other data sets achieved an overall classification 
accuracy of over 80%. In the same time the GlobeLand30 data may show significant variation in different parts of the 
world and in some remote areas data presents the overall accuracy of 46% (Sun et al. 2015). Such a difference is 
mainly due to the difficulty to distinguish some classes such as forest, shrub land and grassland as well as the low 
availability of reference data. For this reason, our approach will involve auxiliary data, including VGI data, in order 
to support analysis of uncertain areas. 


2.2 Open Street Map Land Cover 

Since the term Volunteered Geographic Information (VGI) was introduced by Goodchild (2007), knowledgeable 
amateurs have contributed large amounts of spatially referenced data to different web portals. The OpenStreetMap 
(OSM) (openstreetmap.org) is, without any doubt, the most wide spread and well recognized project. The database 
comprises vector data, which is attributed with a great variety of labels and might serve as source data for various 
cartographic products. Since every contributor can freely edit the database without supervision, the OSM data is het- 
erogeneous in terms of quantity and quality. Assessing the accuracy of the OSM is, therefore, an essential task to 
facility the scientific usage of this data source. Several studies have reported some encouraging results in terms of the 
overall accuracy and completeness (Helbich et al. 2012, Neis et al. 2012). 

The validation of land cover classification requires manually labeled high-quality ground truth data for accuracy 
assessment. VGI -based approaches have been proposed and online communities, such as GEO-Wiki (Fritz et al. 2009, 
geo-wiki.org) are contributing data, specifically labeled for this task. Thus far, however, the coverage of the contrib- 
uted data does not allow for exhaustive validation of large region datasets. In contrast to that, the OSM contributors 
are very actively collecting a broad range of thematic data with close to complete spatial coverage in certain areas 
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(Ribeiro and Fonte 2015). Thus, utilizing the OSM as a source for land cover ground truth data is a promising ap- 
proach. Since the OSM data is not specifically tailored to the needs of land cover map validation, various methods for 
transforming the original data to a more suitable representation have been developed (Fonte et al. 2015). While only 
a portion of the OSM attributes is valuable for a derived land cover map, the coverage is still high enough to be usable, 
especially in urban areas (Ribeiro and Fonte 2015). 
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Fig. 1 . Workflow used for creating land cover map from OSM data. 


In our research, we implemented a method for deriving a land cover reference map from the OSM database as 
shown in Fig. 1. In order to preserve the entire content of the database, we use a complete XML-encoded extract of 
the OSM database, representing our study area, instead of pre-processed Shapefiles, as suggested by Fonte et al. 
(2016). For an efficient processing of the large data amount, we use a PostGIS database for our experiments. For the 
derivation of the land cover map, a subset of the OSM tags, namely “amenity”, “building”, “historic”, “land use”, 
“leisure”, “natural”, “shop”, “tourism”, and “waterway” is considered. We define a mapping from the OSM attributes 
to the classes used in the GLC30 classification scheme. The mapping is only conducted for polygon features, since 
point and line features do not provide immediate information about the coverage of an area. Exploiting additional 
information implicitly contained in point and line features, might be possible in general, using assumptions about 
specific feature classes, such as empirically determined road widths. In order to keep the used data as noise-free as 
possible, this was omitted in our experiments. In a final step, the vector data is rasterized to a 30 m grid with an 
appropriate Minimum Mapping Unit (MMU), merging small features with their neighboring features. The final OSM 
land cover map of our study area is shown in Fig. 2. The total coverage of our reference map is about 71% of the total 
study area. 
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Legend 

Border of Upper Bavaria 
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Fig. 2. Land cover map derived from XML-encoded extract of the OSM database. 


2.3 CORINE Land Cover - CLC2006 

The land cover mapping for the countries of European Union is realised within the programme CORINE (“Coor- 
dination of Information on the Environment”). Based on CORINE Land Cover 2000 for Germany, the data was up- 
dated and land cover product CLC2006 was produced as a vector database. According to Keil et al. (201 1) the main 
data sources of the land cover and land use mapping were satellite images of Landsat 7 (for 2000) and IRS-P6 LISS 
III as well as SPOT-4 and SPOT-5. This product is following common European wide CLC nomenclature and consists 
of 44 classes, where 37 classes are relevant for Germany. Therefore, CLC2006 is characterized as result of the GIS 
derivation considering the land cover changes. The CLC2006 has 25m of minimum mapping unit (MMU) for the 
polygons. The data is released in the projections of Gauss-Kruger Zone 3, Gauss-Kruger Zone 4 or UTM Zone 32. 

Several authors (Gallego 2001, Brovelli et al. 2015, Arsanjani et al. 2016) have proposed to assess the land cover 
quality using CORINE datasets for different study areas. Arsanjani et al. (2016) studied the validation of GlobeLand30 
against CORINE for the area of Germany and showed that overall accuracy is 92%. Another solution was described 
in Brovelli et al. (2015) who indicated that the overall accuracy values of third level CORINE Land Cover are gener- 
ally higher than 80%. In both cases the validation was made using confusion matrix that represents comparisons of a 
land cover map against the referenced dataset. In our study CLC2006, further called CORINE, was used as a variable 
for constructing Bayesian Network. Therefore, the polygon map CLC2006 was gridded for use in the model with the 
grid cell larger than MMU. The resolution of the gridded map is 30m. Moreover, we scaled down the class complexity 
in order to provide consistent classification for all the data sources. That is to say, 44 CORINE land cover classes were 
assigned to 10 classes in line with the GlobeLand30 classification (see the Table 1). 
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Table. 1. CORINE (CLC2006) reclassification based on the GlobeLand30 land cover definition. 


GlobeLand30 land cover classification 

CORINE 

Pixel values 

GlobeLand30 

Pixel values 

Cultivated 

32-41 

10 

Forest 

42-45 

20 

Grassland 

46-47 

30 

Scrubland 

48-49 

40 

Wetland 

55-59 

50 

Water bodies 

60-64 

60 

Tundra 

- 

70 

Artificial surface 

21-31 

80 

Bare land 

50-53 

90 

Permanent ice and snow 

54 

100 


3. Methodology 

Land cover classes derived from the satellite imageries are commonly used for various environmental studies. 
However, due to errors involved in the data acquisition and processing, uncertainties are introduced. The uncertainties 
of classified remote sensing data can be measured using different approaches such as comparison against ground truth, 
analysis of classification statistics (e.g., measures of separation in spectral space), comparisons between different land 
cover products (Quaife and Cripps 2016), or using an alternative method that involves Internet users for the data 
validation (Fritz et al., 2009). The scientific community proposes various methods to validate the classes and to model 
the quality and uncertainty. This article provides an approach of Bayesian Networks, which is widely used in diverse 
scientific domains such as medicine, weather forecasting and social science. Bayesian Networks have the property to 
address the spatial and temporal complexity of the dataset and produce reasoning based on evidences. Recently, Bayes- 
ian Networks are gaining importance for the quantitative and qualitative analysis of the remote sensing data (Quaife 
and Cripps 2016) because they are able to handle measurable information as well as qualitative criteria such as expert 
opinion or preferences based on the contribution of wide public of non-experts. 

The underlying mathematical model of a Bayesian network is based on components such as Directed Acyclic 
Graphs (DAG) and conditional probability tables (CPTs) (Darwiche 2008). The nodes in DAG represent random 
variables, such as land cover classifications, and arrows among them describe dependencies among these variables. 
The process is organised in one-way direction, so that the child doesn’t transfer any feedback to the parent. On this 
note, the Bayesian approach is able to model the proportions of true values in selected pixels at each location across 
the whole study area. The data analysis might include integration of multiple land cover data, auxiliary data sets and 
expert knowledge representing the data of same nature and study area using consistent manner. The output of such 
evaluation consists of the probability maps and uncertainty information. Furthermore, this data might be visualized 
via geoportal, where the land cover data is disseminated or by means of stand-alone visualizations. 



ICC 2017: Proceedings of the 2017 International Cartographic Conference, Washington D.C. 


6 


Data 



OSM & CORINE 
& GLC30 



Data filtering 
Resolution 
Coordinate system 
Land cover classes 



Results 


llw 


Probability maps 




Uncertainty 
(Shannon Index) 


Fig. 3. Pipeline for using Bayesian Network. 

The procedure for using Bayesian Network is depicted in Fig. 3. The graphical model defines the probability rela- 
tions among the variables. In this research we treat the land cover classification GlobeLand30, CORINE and land 
cover derived from VGI data, namely OSM, as variable for the constructing Bayesian Network. We assume that all 
these three datasets are potentially misclassified, therefore each classification can be described in relation with other, 
assigning prior probabilities of each class. If we consider one of the variable (land cover classification) that doesn’t 
have a parent node, the prior probability for this variable is assigned based on the possible values: P = xf), where 
x lt ...,x n are all possible values (i.e., instantiations) of variable X t . The nodes at the next level have parents, therefore 
the conditional probability is assigned. Hence, each variable is described by the probability under the state of its parent. 
If x t indicates the values of variable X t and pa t indicated the set of values for X? s parents, then P{x t \ paf) indicates 
the conditional probability, for example P (X 3 = x 3 \X 1 = x lt X 2 = x 2 ). 

Based on the probability theory the conditional probability is 

P(a\b) = E(a|b)/P(b) or P(a,b) = P(a\b) • P(b), (1) 

Where P (a, b ) is a joint probability of event a A b. 

Therefore, 


P(a \b) • P(b) = P(b\a)P(a ) (2) 

P(a\b)= P(b\a) • P(a)/P(b) (3) 


Equation 3 is the main concept supporting the Bayesian Networks modelling. Using this equation the probabilities 
of event P(a) can be updated based on the new evidence related to event b. Hence, based on the Bayesian theory it is 
possible to update the knowledge of a land cover class considering new/additional evidence (conditional probability). 
The reasoning is based on the degree of belief (posteriori probability). 
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Fig. 4. Probability maps of cultivated and artificial land cover classes. 
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Fig. 5. Uncertainty (Shannon Index) map based on the data fusion of GlobeLand30, CORINE and land cover derived from OSM data. 
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Moreover, the effectiveness of the Bayesian Networks lies also in the ability to compute the conditional probability 
of the descendant nodes as well as parent nodes. One of the challenges when applying BN is to define the probability 
functions. This can be made via discretization process where the values of a variable are represented by discrete 
quantities. Therefore the range of the values is split into intervals defined by the land cover classification. The simu- 
lation was realized using R statistics and package for the spatial implementation of Bayesian Networks and mapping 
“bnspatial”. The outputs of the simulation processes are the expected state of target node (i.e. the state with the highest 
relative probability) and the uncertainty, expressed as entropy by the Shannon index. The example of the output maps 
is illustrated in the Fig. 4 and 5. 


4. Results 

The Bayesian Networks are an effective technique for the analysis of remote sensing data, especially for the rea- 
soning based on diverse sources with varying degrees of reliability. The outputs of this research work are posterior 
probability maps, and the map of uncertainty measured as Shannon index (entropy) (see the figure 3). The maps of 
posterior probability depict updated prior probability of land cover class occurrence considering the information from 
different input datasets. The map of uncertainty was elaborated to depict the diversity in the data and it illustrates 
Shannon entropy applied for the land cover classification. Therefore, the latter map shows the amount of various 
information that each location might be assigned to. The aim of the output maps is to highlight the areas with the 
highest degree of the ambiguity and provide visual guidance for the further land cover validation as well as for a 
deeper understanding of the dynamics related to the high uncertainty. As it can be seen from the Fig. 6, the uncertainty 
of a polygon is high, and when we compare to the original data sets, it is evident that the highlighted area was defined 
in inconsistent manner. Hence, the attention for the validation should be placed at such areas. Moreover, such data 
may be utilized to guide the implementation of decision support tools. 


Uncertainty (Shannon Index) 


OSM Land Cover 



□ Cultivated Land 

■ Forest 

□ Grassland 

□ Shrub Land 

□ Wetland 

□ Water Bodies 
■I Artificial Surface 

■ Bare Land 

□ Permanent Ice and Snow 


GlobeLand30 


CORINE (CLC2006) 


Legend 
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Fig. 6. Interpretation of uncertain areas based on land cover classification from GlobeLand30, CORINE (CLC2006) and Open Street. 

Map. 
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5. Conclusion 

The proposed approach offers a technique for the identification of uncertain areas associated with the output from 
each additional source that can be further visualized within a geoportal of GlobeLand30. The available uncertainty 
information may guide the quality improvement of GlobeLand30 by involving the ground truth data, information with 
superior quality, the know-how of experts and the crowd intelligence. This may finally pave a way towards a seamless 
validation of global land cover data, which is beyond the current validation approach by image processing experts 
using a limited amount of sample data for selected regions. Moreover, it will trigger a targeted knowledge discovery 
in the areas with higher uncertainty values. 

Based on the results we see a possible solution in using statistical methods of data fusion to discover uncertain 
information. The originality of our approach lies in using the Bayesian Networks technique for analysis of global land 
cover data along with the data derived from VGI, namely OSM. This study shows that Bayesian Networks with mul- 
tiple datasets have potential to reveal the hidden patterns that cannot be found by linear processing. However, the 
findings have number of possible limitations, namely the quality of existent reference data, quality of volunteered 
contributions to the OSM and prior expert knowledge about the used data. Therefore, due to the mentioned shortcom- 
ings, we cannot claim that the results highlight misclassified areas. However, the discovered patterns show higher 
degree of entropy and enable visual guidance for data exploration. The data obtained indicate that the more precise 
the prior expert knowledge and the higher data quality of the referenced datasets, the better results could be achieved. 
Therefore, under ideal conditions, the GLC classification could be significantly improved. 

Further research is needed to develop an application for the visual and analytical reasoning under uncertainty of 
land cover classification. Much research addressed the issue of data uncertainty and its signification. Therefore dif- 
ferent visualization techniques were implemented and tested for a variety of data types. However, it was addressed by 
MacEachren (2015) that just little attention has been given to reasoning/decision-making under uncertainty. The 
Bayesian approach can integrate different GIS layers and the expert knowledge to analyse probability of each class 
occurrence and its related uncertainty. Therefore beyond the data fusion and data visualisation, the further work will 
contribute with the decision-making environment for the remotely-sensed data analysis. 
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